Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: use podman instead of docker #207

Merged
merged 1 commit into from
Sep 11, 2018

Conversation

ashcrow
Copy link
Member

@ashcrow ashcrow commented Sep 5, 2018

NOTE: This is untested ... but in theory it should work 😄

The following items were left alone:

  • jenkins tests
  • ./modules/ignition/resources/services/kubelet.service

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 5, 2018
@crawford
Copy link
Contributor

crawford commented Sep 5, 2018

ok to test

@mrunalp
Copy link
Member

mrunalp commented Sep 5, 2018

@mheon @baude @rhatdan PTAL

Copy link

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -10,5 +10,5 @@ jobs today for testing purposes.
Example:

```sh
docker build -t quay.io/coreos/tectonic-builder:v1.33 -f images/tectonic-builder/Dockerfile .
podman build -t quay.io/coreos/tectonic-builder:v1.33 -f images/tectonic-builder/Dockerfile .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day it should be called 'PodManfile' instead of Dockerfile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One day it should be called 'PodManfile' instead of Dockerfile.

They're going for backwards-compat. More on this in containers/buildah#851.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 5, 2018
@baude
Copy link

baude commented Sep 5, 2018

LGTM, what could go wrong? :)

builder-run Outdated
@@ -1,6 +1,6 @@
#!/bin/bash -e
#
# A utility script for executing arbitrary local scripts via docker.
# A utility script for executing arbitrary local scripts via podman.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is from coreos/tectonic-installer#609, but doesn't seem to have ever been used anywhere. With installer/scripts gone since coreos/tectonic-installer#3067, I'd guess we can just drop builder-run entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to drop whatever we can :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'd guess we can just drop builder-run entirely.

I've spun this off into #229.

@@ -4,7 +4,7 @@ set -e
echo "Rendering Kubernetes core manifests..."

# shellcheck disable=SC2154
/usr/bin/docker run \
/usr/bin/podman run \
Copy link
Member

@wking wking Sep 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks Container Linux, no? I'll launch the smoke tests to see... [edit: @crawford already launched them :)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. We'll need to remove CL support before we can merge this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#221 is ready to go (once openshift/release#1317 merges).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashcrow can you also update bootkube.service so that it no longer depends on docker.service?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will update shortly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⬆️

@wking wking mentioned this pull request Sep 5, 2018
# See https://github.com/kubernetes/kubernetes/issues/43292

echo "Starting etcd certificate signer..."

# shellcheck disable=SC2154
SIGNER=$(/usr/bin/docker run -d \
SIGNER=$(/usr/bin/podman run -d \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman has an issue with --tmpfs I think

Sep 05 19:13:13 dev-bootstrap bash[1448]: container create failed: container_linux.go:336: starting container process caused "process_linux.go:399: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/containers/storage/overlay-containers/47c973458ee9cfdbc09dc50511ea73f7e03f89f21c5d53cab69b11b46f812d99/userdata/tmpfs\\\" to rootfs \\\"/var/lib/containers/storage/overlay/a5858d7251a0798e7a86412cc38865b6526d9634334c641f0dc46da03d70e675/merged\\\" at \\\"/tmp/runctmpdir916880328\\\" caused \\\"tmpcopyup: failed to move mount /tmp/runctmpdir916880328 to /var/lib/containers/storage/overlay/a5858d7251a0798e7a86412cc38865b6526d9634334c641f0dc46da03d70e675/merged/tmp: invalid argument\\\"\""
Sep 05 19:13:13 dev-bootstrap bash[1448]: : internal libpod error

@mrunalp @mheon

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containers/podman#1413 should work around it

There is probably an underlying runc bug we should triage and fix, but this should unblock you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsm5 We need a new runc and @ashcrow we need it in RHCOS :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp sure. Let me know where we can snag it and we can make sure it makes it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current versions in RHCOS build 5165

# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://rhcos:openshift/3.10/x86_64/os
                   Version: 4.0.5165 (2018-09-05 10:06:59)
                    Commit: 119e6fb7d7fa12d8685dd5e641bf23b56a3a76e81b7394c055f51d0388edf239

# rpm-ostree db list 119e6fb7d7fa12d8685dd5e641bf23b56a3a76e81b7394c055f51d0388edf239 | grep -e runc -e cri-o -e podman
 cri-o-1.11.2-1.git3eac3b2.el7.x86_64
 podman-0.7.3-1.git0791210.el7.x86_64
 runc-1.0.0-37.rc5.dev.gitad0f525.el7.x86_64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if something needs to be bumped, and where to get it if it is in a non-standard location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp is this build available in the right places? Where is that being tracked? What BZ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsm5 is on it and we should have it today.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member

wking commented Sep 5, 2018

/lgtm cancel

We know why this is failing and don't want the bots banging away on retests.

@wking
Copy link
Member

wking commented Sep 11, 2018

With #221 and #229 landed, this needs a rebase. And there shouldn't be too much in flight with conflicts, so now would be a good time. On the other hand, no rush either :p.

@ashcrow
Copy link
Member Author

ashcrow commented Sep 11, 2018

@wking rebased

@eparis
Copy link
Member

eparis commented Sep 11, 2018

rebased, but still doesn't work, right? podman is still busted?

# rpm -q runc podman cri-o
runc-1.0.0-37.rc5.dev.gitad0f525.el7.x86_64
podman-0.7.3-1.git0791210.el7.x86_64
cri-o-1.11.2-1.git3eac3b2.el7.x86_64
# cat /etc/os-release | grep PRETTY
PRETTY_NAME="Red Hat CoreOS 4.0.5459"
# journalctl | grep tmpfs
Sep 11 13:30:53 test1-bootstrap bash[892]: container create failed: container_linux.go:336: starting container process caused "process_linux.go:399: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/containers/storage/overlay-containers/b7926b6a92bae8627cac25bef3928b6992a5403b7498d0b1c25d90276f5de226/userdata/tmpfs\\\" to rootfs \\\"/var/lib/containers/storage/overlay/ffdb77fd6c4fd30a948b78abe677fd26237c0e2be0adba8baa89c2d806631b37/merged\\\" at \\\"/tmp/runctmpdir145219212\\\" caused \\\"tmpcopyup: failed to move mount /tmp/runctmpdir145219212 to /var/lib/containers/storage/overlay/ffdb77fd6c4fd30a948b78abe677fd26237c0e2be0adba8baa89c2d806631b37/merged/tmp: invalid argument\\\"\""

@mheon
Copy link

mheon commented Sep 11, 2018

Looks like your runc is still too old - per @mrunalp should hopefully be fixed sometime today?

@ashcrow
Copy link
Member Author

ashcrow commented Sep 11, 2018

That's correct. I got the new location this morning and am verifying we can switch.

@eparis
Copy link
Member

eparis commented Sep 11, 2018

We have another/different failure....

Sep 11 13:48:55 localhost.localdomain systemd[1]: Starting Bootstrap a Kubernetes cluster...
Sep 11 13:48:55 localhost.localdomain bash[901]: Rendering Kubernetes core manifests...
Sep 11 13:48:56 localhost.localdomain bash[901]: Trying to pull quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1...Failed
Sep 11 13:48:56 localhost.localdomain bash[901]: unable to find image: unable to pull quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1: unable to find image in the registries defined in "/etc/containers/registries.conf"

But then podman pull worked just fine... So why was podman incapable of pulling it the first time?

@mheon
Copy link

mheon commented Sep 11, 2018

Can you drop a --log-level=debug on the pull that's failing?

@eparis
Copy link
Member

eparis commented Sep 11, 2018

-- Logs begin at Tue 2018-09-11 14:21:04 UTC, end at Tue 2018-09-11 14:24:52 UTC. --
Sep 11 14:22:13 localhost.localdomain systemd[1]: Starting Bootstrap a Kubernetes cluster...
Sep 11 14:22:13 localhost.localdomain bash[881]: Rendering Kubernetes core manifests...
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="[graphdriver] trying provided driver "overlay""
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="overlay: override_kernelcheck=true"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="overlay test mount with multiple lowers succeeded"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="backingFs=xfs, projectQuotaSupported=false, useNativeDiff=true"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=info msg="Found CNI network crio-bridge (type=bridge) at /etc/cni/net.d/100-crio-bridge.conf"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=info msg="Found CNI network 200-loopback.conf (type=loopback) at /etc/cni/net.d/200-loopback.conf"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=info msg="Found CNI network podman (type=bridge) at /etc/cni/net.d/87-podman-bridge.conflist"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.override_kernel_check=true]quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1""
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="reference "[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.override_kernel_check=true]quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1" does not resolve to an image ID"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="parsed reference into "[overlay@/var/lib/containers/storage+/var/run/containers/storage:overlay.override_kernel_check=true]quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1""
Sep 11 14:22:13 localhost.localdomain bash[881]: Trying to pull quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1...time="2018-09-11T14:22:13Z" level=debug msg="Using registries.d directory /etc/containers/registries.d for sigstore configuration"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg=" Using "default-docker" configuration"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg=" No signature storage configuration found for quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="Looking for TLS certificates and private keys in /etc/docker/certs.d/quay.io"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="GET https://quay.io/v2/"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="Ping https://quay.io/v2/ err &url.Error{Op:"Get", URL:"https://quay.io/v2/", Err:(*net.OpError)(0xc42006b4a0)}"
Sep 11 14:22:13 localhost.localdomain bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="GET https://quay.io/v1/_ping"
Sep 11 14:22:13 test1-bootstrap bash[881]: time="2018-09-11T14:22:13Z" level=debug msg="Ping https://quay.io/v1/_ping err &url.Error{Op:"Get", URL:"https://quay.io/v1/_ping", Err:(*net.OpError)(0xc420402550)}"
Sep 11 14:22:13 test1-bootstrap bash[881]: Failed
Sep 11 14:22:14 test1-bootstrap bash[881]: time="2018-09-11T14:22:14Z" level=error msg="unable to find image: unable to pull quay.io/coreos/kube-core-renderer-dev:0a24db2288f00b10ced358d9643debd601ffd0f1: unable to find image in the registries defined in "/etc/containers/registries.conf""

@ashcrow ashcrow mentioned this pull request Sep 11, 2018
2 tasks
@mheon
Copy link

mheon commented Sep 11, 2018

Looks like a problem pinging the registry, but those error messages aren't being printed properly - probably needs a fix on our end

@mheon
Copy link

mheon commented Sep 11, 2018

Correction, the error printing is in c/image - I'll see if it can't be improved

@baude
Copy link

baude commented Sep 11, 2018

i think this is all fixed in newer versions of podman @mheon

@mheon
Copy link

mheon commented Sep 11, 2018

@baude Negative on the c/image error - it looks like a network flake, but the error message is honestly terrible and needs to be fixed

@eparis
Copy link
Member

eparis commented Sep 11, 2018

I'm very interested to know what the network error was. As I added ip a before the podman run and at least the iterface is set up in time... (It fails for me about 1 in 4 tries)

I didn't have this problem with docker, but it could be that docker is just slower and so it works...

@crawford crawford changed the title Use podman where possible *: use podman instead of docker Sep 11, 2018
@crawford
Copy link
Contributor

Updated with a few fixes after testing this locally.

@crawford
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 11, 2018
Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, ashcrow, crawford, rajatchopra

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,crawford,rajatchopra]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

--rm \
--network host \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't able to access the outside network without. I thought I heard that this was expected behavior with podman (though, I don't know where I heard that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have worked....

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman should work fine without requiring --network=host. What version of podman were you testing with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhatdan podman version 0.7.3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman version 0.7.3

If this is a known issue with an upstream fix, we may be able to drop this after #253 takes us up to RHCOS 4.0.5595-1 with:

$ curl -s http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/4.0.5595-1/pkglist.txt | grep podman
 podman-0.8.5-2.gitdc5a711.el7.x86_64

Thank you openshift/os#288 :).

@wking
Copy link
Member

wking commented Sep 11, 2018

Do we want to update the remaining docker instances as well?

$ git log --oneline -1 origin/pr/207
c234fc3 *: use podman instead of docker
$ git grep -i docker c234fc3 | grep -v vendor/
c234fc3:config.tf:This is known to be a "Docker pull secret" as produced by the docker login [1] command.
c234fc3:config.tf:[1] https://docs.docker.com/engine/reference/commandline/login/
c234fc3:examples/tectonic.aws.yaml:# This is known to be a "Docker pull secret" as produced by the docker login [1] command.
c234fc3:examples/tectonic.aws.yaml:# [1] https://docs.docker.com/engine/reference/commandline/login/
c234fc3:examples/tectonic.libvirt.yaml:# This is known to be a "Docker pull secret" as produced by the docker login [1] command.
c234fc3:examples/tectonic.libvirt.yaml:# [1] https://docs.docker.com/engine/reference/commandline/login/
c234fc3:hack/go-fmt.sh:    docker.io/openshift/origin-release:golang-1.10 \
c234fc3:hack/go-lint.sh:    docker.io/openshift/origin-release:golang-1.10 \
c234fc3:hack/go-vet.sh:    docker.io/openshift/origin-release:golang-1.10 \
c234fc3:images/installer-origin-release/Dockerfile.ci:# This Dockerfile is used as a CI job to build smoke tests for OpenShift installer
c234fc3:images/tectonic-installer/Dockerfile:# Docker build does not follow symlinks (see
c234fc3:images/tectonic-installer/Dockerfile:# from bazel-bin/tectonic.tar.gz to the docker build context folder of your
c234fc3:images/tectonic-installer/Dockerfile.ci:# This Dockerfile is a used as a CI job to validate Tectonic installer on CentOS
c234fc3:installer/pkg/validate/validate.go:		return errors.New("overlaps with default Docker Bridge subnet (172.17.0.0/16)")
c234fc3:installer/pkg/validate/validate_test.go:		{"172.17.1.2/20", "overlaps with default Docker Bridge subnet (172.17.0.0/16)"},
c234fc3:modules/bootkube/resources/manifests/pull.json:  "type": "kubernetes.io/dockerconfigjson",
c234fc3:modules/bootkube/resources/manifests/pull.json:    ".dockerconfigjson": "${pull_secret}"
c234fc3:modules/tectonic/resources/manifests/ingress/pull.json:  "type": "kubernetes.io/dockerconfigjson",
c234fc3:modules/tectonic/resources/manifests/ingress/pull.json:    ".dockerconfigjson": "${pull_secret}"
c234fc3:modules/tectonic/resources/manifests/secrets/pull.json:  "type": "kubernetes.io/dockerconfigjson",
c234fc3:modules/tectonic/resources/manifests/secrets/pull.json:    ".dockerconfigjson": "${pull_secret}"
c234fc3:tests/run.sh:# In future bazel build could be extracted to another job which could be running in docker container like this:
c234fc3:tests/run.sh:# docker run --rm -v $PWD:$PWD:Z -w $PWD quay.io/coreos/tectonic-builder:bazel-v0.3 bazel build tarball smoke_tests
c234fc3:tests/smoke/glide.lock:- name: github.com/docker/distribution
c234fc3:tests/smoke/glide.lock:- name: github.com/docker/engine-api
c234fc3:tests/smoke/glide.lock:- name: github.com/docker/go-connections
c234fc3:tests/smoke/glide.lock:- name: github.com/docker/go-units

It looks like we could make a few docker login -> podman login changes and bump the doc link to point to Podman's. I dunno if anyone uses images/tectonic-installer/Dockerfile; we could probably just drop that. I'm not sure what Podman's default bridge subnet is. And we could update run.sh to use podman.

# See https://github.com/kubernetes/kubernetes/issues/43292

echo "Starting etcd certificate signer..."

# shellcheck disable=SC2154
SIGNER=$(/usr/bin/docker run -d \
--tmpfs /tmp \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this was basically for security so that the keys/cert never hit the disk

@rhatdan
Copy link

rhatdan commented Sep 13, 2018

Awesome job guys, love to see our dependency on Docker shrink.

wking added a commit to wking/openshift-installer that referenced this pull request Sep 14, 2018
We'd removed this in c234fc3 (*: use podman instead of docker,
2018-09-05, openshift#207) to work around [1,2].  Now that RHCOS is up to:

  $ curl -s http://aos-ostree.rhev-ci-vms.eng.rdu2.redhat.com/rhcos/images/cloud/latest/pkglist.txt | grep runc
   runc-1.0.0-52.dev.git70ca035.el7_5.x86_64

we can restore the option.

[1]: openshift/os#284
[2]: containers/podman#1396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.